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Abstract 

The  United  States  Air  Force  (USAF)  officer  sustainment  system  involves  making 
accession  and  promotion  decisions  for  nearly  64  thousand  officers  annually.  We  formu¬ 
late  a  discrete  time  stochastic  Markov  decision  process  model  to  examine  this  military 
workforce  planning  problem.  The  large  size  of  the  motivating  problem  suggests  that 
conventional  exact  dynamic  programming  algorithms  are  inappropriate.  As  such,  we 
propose  two  approximate  dynamic  programming  (ADP)  algorithms  to  solve  the  prob¬ 
lem.  We  employ  a  least-squares  approximate  policy  iteration  (API)  algorithm  with 
instrumental  variables  Bellman  error  minimization  to  determine  approximate  policies. 
In  this  API  algorithm,  we  use  a  modified  version  of  the  Bellman  equation  based  on  the 
post-decision  state  variable.  Approximating  the  value  function  using  a  post-decision 
state  variable  allows  us  to  find  the  best  policy  for  a  given  approximation  using  a 
decomposable  mixed  integer  nonlinear  programming  formulation.  We  also  propose 
an  approximate  value  iteration  algorithm  using  concave  adaptive  value  estimation 
(CAVE).  The  CAVE  algorithm  identifies  an  improved  policy  for  a  test  problem  based 
on  the  current  USAF  officer  sustainment  system.  The  CAVE  algorithm  obtains  a 
statistically  significant  2.8%  improvement  over  the  currently  employed  USAF  policy, 
which  serves  as  the  benchmark. 
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APPROXIMATE  DYNAMIC  PROGRAMMING  ALGORITHMS  FOR 
UNITED  STATES  AIR  FORCE  OFFICER  SUSTAINMENT 

I.  Introduction 

“The  basic  manpower  problem  is  the  following:  Determine  the  number  of 
personnel  and  their  skills  that  best  meets  the  future  operational  require¬ 
ments  of  an  enterprise.”  (Gass,  1991) 

The  United  States  Air  Force  (USAF)  is  comprised  of  approximately  330,000  per¬ 
sonnel  who  enhance  national  security  by  providing  the  distinctive  capabilities  of  air 
and  space  superiority,  global  attack,  rapid  global  mobility,  precision  engagement,  in¬ 
formation  superiority,  and  agile  combat  support  to  the  Department  of  Defense  (DoD). 
The  USAF,  like  the  other  branches  of  the  military,  is  comprised  of  commissioned  of¬ 
ficers  as  well  as  the  enlisted  force.  These  two  groups  exhibit  significantly  different 
behaviors  in  regards  to  retention,  promotion,  and  cross-flow  between  career  fields. 
This  research  investigates  and  attempts  to  discover  improved  policies  regarding  man¬ 
agement  of  the  commissioned  officer  corps. 

The  USAF  must  recruit,  train,  and  develop  its  personnel  using  limited  resources. 
Over  the  last  several  years,  the  draw  down  from  Operations  Enduring  Freedom  and 
Iraqi  Freedom  as  well  as  shifting  domestic  priorities  have  resulted  in  significant  cuts 
to  current  and  future  outlays  for  acquisitions,  operations,  and  personnel  budgets. 
Difficult  fiscal  conditions  emphasize  the  importance  of  having  the  correct  mix  of 
personnel  to  field  a  ready  force.  The  USAF  must  balance  the  needs  for  officers  of 
varying  levels  of  experience  within  90  career  fields  ranging  from  personnel  officers 
to  fighter  pilots.  Each  of  these  career  fields  is  labeled  with  an  Air  Force  Specialty 
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Code  (AFSC).  The  USAF  has  a  known  set  of  requirements  (i.e.,  demand)  and  known 
Congressionally-mandated  force  size  constraints. 

Grades  range  from  0-1  to  0-10,  with  the  grades  0-7,  0-8,  0-9,  and  0-10  corre¬ 
sponding  to  the  ranks  of  General  Officers.  The  model  proposed  in  this  thesis  only 
considers  grades  from  0-1  to  0-6,  which  comprise  the  vast  majority  of  officers  com¬ 
prising  the  officer  sustainment  problem.  Current  USAF  policy  is  a  100%  promotion 
rate  from  0-1  to  0-2  and  from  0-2  to  0-3.  This  policy  is  primarily  due  to  long 
training  times  and  a  limited  performance  record  with  which  to  differentiate  junior 
officers  at  these  grades.  Moreover,  officers  at  the  grades  of  0-1  and  0-2  frequently  fill 
0-3  requirements.  A  complicating  feature  of  the  manpower  planning  problem  faced 
by  the  USAF  is  the  fact  that  senior  officers  are  developed  from  junior  officers  only, 
with  every  recruited  officer  starting  at  the  grade  of  0-1. 

Due  to  these  characteristics,  poor  personnel  management  decisions  can  have  far- 
reaching  impacts  and  corrective  actions  can  take  a  significant  amount  of  time  to 
take  effect.  Economic  factors  and  changes  within  the  military  environment  such 
as  operations  tempo,  salary,  and  benefits  such  as  health  care  and  combat  pay  can 
significantly  impact  retention  rates  (Asch  et  al,  2008;  Murray,  2004).  This  can  result 
in  a  significant  level  of  deviation  in  retention  rates  over  time,  resulting  in  a  uncertain 
supply  of  officers  to  meet  USAF  personnel  requirements.  These  factors  can  make 
predicting  how  a  force  structure  will  develop  and  progress  a  difficult  process. 

The  current  USAF  personnel  system  determines  the  number  of  requirements  for 
each  AFSC  and  grade  combination.  Additionally,  ten  years  of  historical  data  are 
used  to  calculate  retention  rates  for  each  combination  of  AFSC  and  number  of  com¬ 
missioned  years  of  service  (CYOS).  This  information  informs  the  development  of  a 
retention  line.  This  retention  line  assumes  a  deterministic  retention  based  on  his¬ 
torical  observations  and  no  future  deviation.  The  current  model  used  to  optimize 
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accessions  smooths  the  total  requirements  for  each  AFSC  over  the  projected  reten¬ 
tion  line.  Promotion  decisions  are  made  independently  from  the  force  optimization 
model,  so  the  accession  decisions  and  promotion  decisions  are  not  tied  together  by 
means  of  a  holistic  policy.  By  addressing  only  accessions  policies  with  the  static  op¬ 
timization  model,  such  policies  are  unable  to  address  any  deviations  from  expected 
outcomes  over  the  30  year  window.  Over  time,  these  deviations  can  become  large, 
which  has  historically  resulted  in  the  use  of  significant  measures  to  boost  or  lower 
retention  such  as  paying  bonuses  to  keep  people  in  (Lakhani,  1988;  Simon  &  Warner, 
2009)  or  reductions  in  force  (RIFs)  to  decrease  the  size  of  the  force.  Deviations 
compounded  by  changes  in  the  desired  force  structure  during  times  of  build-up  or 
downsizing  can  exacerbate  the  level  of  correction  needed. 

While  paying  bonuses  has  an  easily  calculable  cost,  RIFs  have  more  subtle  costs. 
Mo ne  (1994)  discovered  that  in  a  steady  state  organization,  persons  with  low  self- 
efficacy  are  significantly  more  likely  to  depart  the  organization  than  the  higher  per¬ 
forming  persons  with  high  self-efficacy.  However,  when  an  organization  is  actively 
downsizing,  the  correlation  is  reversed.  The  high  performers  begin  leaving  signifi¬ 
cantly  more  frequently  than  the  lower  performers.  Additionally,  Wong  &  McNally 
(1994)  showed  during  the  force  reductions  in  the  US  Army  in  the  1990s  that  organi¬ 
zational  commitment  decreased  significantly  in  the  survivors  of  the  RIFs,  even  when 
the  primary  means  used  to  trim  the  force  were  voluntary  separations. 

We  formulate  a  Markov  decision  process  (MDP)  model  to  examine  the  Air  Force’s 
officer  sustainment  problem.  MDPs  have  several  features  that  make  them  particularly 
suitable  for  this  sort  of  workforce  planning  problem.  An  MDP  can  provide  policies 
that  are  state-dependent,  which  allows  for  a  workforce  system  to  correct  over  time. 
State  transitions  can  be  modeled  stochastically,  allowing  the  MDP  model  to  address 
the  uncertainty  inherent  in  the  personnel  system. 
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The  state  of  the  system  for  this  problem  is  found  by  aggregating  individual  officers 
by  class  descriptors.  The  three  class  descriptors  for  the  USAF  officer  sustainment  sys¬ 
tem  are  career  field  (AFSC),  commissioned  years  of  service  (CYOS),  and  grade  (i.e., 
rank).  The  state  represents  the  current  total  stock  of  officers,  categorized  by  each 
possible  combination  of  class  descriptors.  In  each  time  period,  individuals  determin¬ 
istically  maintain  their  career  field,  stochastically  transition  either  out  of  the  system 
or  to  the  next  CYOS  according  to  a  retention  parameter,  and  stochastically  transi¬ 
tion  either  to  the  next  grade  or  remain  in  their  current  grade  according  to  promotion 
rate  decisions  made  within  the  model.  The  model  provides  a  policy,  n,  given  the 
current  state,  that  indicates  the  number  of  officers  to  recruit  for  each  career  field  (i.e., 
accessions)  as  well  as  the  percentage  of  officers  within  specified  promotion  windows 
to  be  promoted.  A  single  Line  of  the  Air  Force  competitive  category  is  examined,  so 
promotion  policies  apply  to  officers  in  a  specified  promotion  window  across  all  career 
fields  within  the  model.  The  contribution  function  imposes  a  cost  for  shortages  of 
officers  by  career  field  and  grade  as  well  as  a  cost  for  exceeding  the  maximum  num¬ 
ber  of  allowable  officers.  These  costs  are  weighted  to  reflect  the  criticality  of  certain 
AFSC  and  grade  combinations. 

The  state  space  of  the  motivating  problem  of  interest  has  9,720  dimensions  repre¬ 
senting  the  full  54  Line  of  the  Air  Force  AFSCs,  30  CYOS  groups,  and  6  grades.  This 
level  of  dimensionality  combined  with  the  stochastic  nature  of  the  state  transitions 
makes  determining  a  stationary  policy  computationally  intractable.  The  size  of  the 
problem  suggests  that  development  of  an  exact  dynamic  programming  algorithm  to 
obtain  a  solution  is  inappropriate. 

In  order  to  address  the  large  size  of  the  problem,  two  approximate  dynamic  pro¬ 
gramming  (ADP)  algorithms  are  developed  to  obtain  non-optimal  but  high  quality 
solutions.  The  first  proposed  algorithm  uses  least  squares  temporal  differences  with 


4 


Bellman  error  minimization  in  an  approximate  policy  iteration  framework  to  obtain 
policies.  As  part  of  the  process,  we  simulate  potential  post-decision  states  and  ob¬ 
serve  the  value  of  a  possible  outcome  of  being  in  that  state.  After  a  batch  of  these 
observations  is  simulated,  a  regression  is  performed  utilizing  instrumental  variables  to 
minimize  Bellman  error.  This  algorithm  uses  a  set  of  basis  functions  to  approximate 
the  value  of  the  post-decision  state.  This  approximation  scheme  allows  the  formula¬ 
tion  of  a  non-linear  mixed-integer  program  to  solve  the  inner  maximization  problem, 
obtaining  optimal  actions  based  on  the  current  approximation.  Algorithm  variants 
using  instrumental  variables  regression  and  Latin  hypercube  sampling  are  examined. 

We  also  use  the  Concave,  Adaptive  Value  Estimation  (CAVE)  algorithm  (Godfrey 
&  Powell,  2002)  to  develop  separable  piecewise  linear  value  approximations  that  rep¬ 
resent  the  ‘cost  to  go’  value  function  for  a  finite-horizon  formulation  of  the  problem. 
This  algorithm  simulates  potential  outcomes  of  a  given  policy  and  uses  the  outcomes 
to  update  the  estimate  of  the  gradients  of  the  value  function  approximation.  The 
algorithm  takes  advantage  of  known  problem  structure  to  efficiently  update  the  value 
function  approximation  of  large  numbers  of  policies  simultaneously. 

The  remainder  of  this  thesis  is  organized  as  follows.  In  Chapter  2,  a  detailed  back¬ 
ground  of  the  USAF  officer  sustainment  problem  and  relevant  techniques  and  models 
is  provided.  Chapter  3  provides  two  problem  formulations  and  the  methodology  used 
to  develop  and  evaluate  the  model.  Chapter  4  describes  and  compares  the  results  of 
these  models.  Chapter  5  draws  conclusions  from  these  results. 
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II.  Literature  Review 


This  chapter  examines  prior  manpower  and  workforce  planning  models,  force  man¬ 
agement  issues,  Markov  decision  process  theory,  and  approximate  dynamic  program¬ 
ming  (ADP)  theory.  Prior  manpower  models  have  not  typically  used  Markov  decision 
processes  to  model  large  scale  workforce  planning  problems  due  to  the  computation¬ 
ally  intensive  nature  of  the  necessary  calculations.  Markov  decision  processes  (MDP) 
have  been  used  for  years  by  those  in  the  operations  research  community  to  address 
smaller  discrete  stochastic  problems  (Puterman,  1994).  However,  only  recently  has 
the  potential  of  this  construct  begun  to  be  realized.  With  the  advent  of  approximate 
dynamic  programming,  the  MDP  construct  can  be  applied  to  large,  high-dimension 
problems  (Powell,  2009).  Application  of  these  techniques  to  highly  dimensional  man¬ 
power  and  personnel  problems  has  thus  far  been  limited,  although  applications  to 
similar  resource  allocation  problems  are  well  documented. 

2.1  Manpower  Planning  Models 

Markov  Chain  Models. 

The  application  of  Markov  chain  theory  is  common  when  modeling  dynamic  be¬ 
havior  in  discrete  time,  push-flow  manpower  systems,  where  transitions  are  deter¬ 
mined  by  fixed  rates  from  the  originating  state,  as  opposed  to  pull-flow  manpower 
systems  where  transitions  are  determined  by  vacancies  to  fill.  Markov  chains  are  typ¬ 
ically  used  as  descriptive  models  due  to  the  lack  of  any  mathematical  programming 
(Wang,  2005).  The  result  of  this  simplicity  is  that  the  system  cannot  dynamically 
alter  these  transitions  internally  if  the  result  is  unsatisfactory.  In  effect,  the  system 
models  a  single  fixed  policy. 
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Skulj  et  al.  (2008)  apply  Markov  chain  theory  to  the  Slovenian  military  manpower 
problem.  Skulj  identified  and  tracked  transition  rates  between  states  for  five  years, 
then  identified  which  transition  rates  could  be  adjusted.  An  iterative  simulation  was 
run,  during  which  a  set  of  transitions  (actions)  was  identified  that  resulted  in  a  close 
match  to  the  desired  force  structure  within  four  years.  However,  no  static  set  of 
transition  functions  were  found  to  maintain  the  force  structure  at  the  desired  levels. 

Filinkov  et  al.  (2011)  use  Markov  chains  to  determine  how  sustainable  current 
Australian  military  force  levels  are  for  different  possible  future  deployment  require¬ 
ments.  Markov  chains  are  used  to  model  how  a  force  develops  over  time  given  a 
current  set  of  policies.  A  series  of  deterministic  equations  are  used  to  determine 
the  cost  of  a  given  outcome  in  terms  of  being  able  to  sustain  operational  demands 
associated  with  a  variety  of  future  scenarios. 

McClean  (1991)  compares  the  application  of  Markov  chains  to  semi-Markov  re¬ 
newal  models.  McClean  notes  that  the  Markov  chain  approach  is  oversimplified  in  that 
it  is  limited  to  modeling  a  push  system  and  requires  assumptions  that  the  probability 
distributions  of  the  underlying  transitions  are  geometric  or  exponential.  However, 
renewal  theory  approaches  are  limited  to  modeling  pull  systems  with  constant  grade 
sizes  and  are  mathematically  intractable  when  applied  to  reasonably  sized  problems. 

Kinstler  et  al.  (2008)  develop  a  Markov  chain  model  to  analyze  the  US  Navy 
nursing  corps.  Specific  recruiting  guidelines  are  identified  as  the  cause  of  specified 
rank  imbalances.  Tradeoffs  between  levels  of  rank  imbalance  and  the  ability  to  recruit 
different  ranks  into  the  nursing  corps  are  explored. 

Simulation  Models. 

Simulation  models  are  a  valuable  tool  for  assessing  the  results  of  different  behav¬ 
iors  or  policies  within  a  system  that  may  be  too  complex  for  analytical  models  such 
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as  Markov  chains.  However,  like  Markov  chains,  simulations  are  primarily  a  descrip¬ 
tive  tool  instead  of  an  optimizing  tool.  Moreover,  stochastic  simulations  often  face 
significant  problems  with  auto-correlation  in  the  output  which  may  require  advanced 
statistical  analysis  to  interpret,  given  the  variability  internal  to  the  model  (Wang, 
2005). 

McGinnis  et  al.  (1994)  construct  a  simulation  of  the  US  Army  officer  professional 
development  system  to  determine  impacts  of  changes  to  the  legal  requirements  for 
promotion  clue  to  Title  IV  of  the  Defense  Reorganization  Act  of  1986.  Their  sim¬ 
ulation  is  designed  to  evaluate  potential  changes  to  laws,  policies,  or  internal  force 
structure  actions  and  inform  senior  personnel  leaders  of  the  consequences  of  these 
changes. 

RAND  uses  the  PILOT  model  to  simulate  how  pilots  are  developed  and  trained 
(Mooz,  1969).  The  model  determines  the  impacts  of  changing  pilot  demand  levels  for 
different  aircraft  systems.  When  the  requirements  for  pilots  of  one  type  of  aircraft 
increase,  some  pilots  are  transitioned  from  other  aircraft,  requiring  different  levels 
of  retraining  to  become  proficient.  These  transitions  result  in  increased  demand  for 
pilots  from  both  the  aircraft  being  expanded  as  well  as  the  aircraft  supplying  more 
experienced  pilots.  The  PILOT  model  uses  simulation  to  determine  what  resources 
are  required  in  terms  of  cost,  training  crews,  and  training  aircraft  in  order  to  meet 
potential  levels  of  demand. 

Network  Flow  Models. 

Network  flow  models  are  another  useful  tool  for  lower  dimensional  problems  (Gass, 
1991).  Network  flow  models  have  the  benefit  of  being  easy  to  visualize  and  compre¬ 
hend  for  the  non-technical  decision  maker.  Gass  (1991)  shows  that  several  manpower 
problems  can  be  formulated  as  a  general  minimal-cost  transshipment  (flow)  network. 


A  significant  limitation  of  this  approach  is  that  while  different  flows  may  enter  a  node, 
these  flows  have  only  a  common  identifier  when  exiting  the  node.  This  results  in  a 
level  of  aggregation  that  may  not  be  acceptable  to  all  problem  sets. 

Mulvey  (1979)  compares  the  results  of  applying  a  network  flow  model,  an  integer 
program,  and  a  simplified  aggregate  model  to  a  manpower  scheduling  problem.  The 
integer  program  is  shown  to  require  more  information  to  implement  than  the  net¬ 
work  model;  however,  the  integer  program  handles  a  wider  variety  of  scenarios.  The 
network  flow  model  is  easier  to  implement  and  simpler  computationally. 

System  Dynamics  Models. 

System  dynamics  (SD)  models  allow  the  examination  of  continuous  flows  over  time 
by  incorporating  feedback  loops  to  model  interactions  between  different  structures. 
This  technique  relies  heavily  on  the  development  of  the  appropriate  equations  to 
model  the  system  (Wang,  2005). 

Thomas  et  al.  (1997)  provide  a  system  dynamics  model  as  an  alternative  to  a 
suite  of  tools  used  for  US  Army  enlisted  personnel  management.  A  key  benefit  of 
this  approach  is  the  ability  to  generate  causal  loop  diagrams  to  verify  elemental 
assumptions  within  the  model.  However,  this  model  is  limited  to  analyzing  aggregated 
behavior  instead  of  differentiating  by  career  field,  grade,  and  time  in  service,  which 
would  significantly  increase  the  level  of  intricacy  of  the  required  equations.  The 
system  dynamics  approach  for  aggregated  enlisted  behavior  is  validated  by  replication 
of  historical  scenarios. 

Optimization  Models. 

Optimization  techniques  such  as  linear  programming  (LP),  integer  programming 
(IP),  goal  programming  (GP),  and  dynamic  programming  (DP)  do  provide  methods 
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to  find  optimal  solutions,  but  face  significant  limitations  in  their  ability  to  handle 
certain  classes  of  problems  (Wang,  2005).  Linear  programming  and  integer  program¬ 
ming  are  limited  to  minimizing  or  maximizing  a  single  objective  function  and  can  be 
limited  in  their  ability  to  handle  stochastic  problems.  Dynamic  programming,  often 
formulated  as  a  Markov  Decision  Process,  is  a  powerful  tool  for  handling  sequential 
decision  making  under  uncertainty,  but  can  require  significant  expertise  to  formulate 
(Wang,  2005)  and  is  computationally  intensive  for  most  real  world  problems  (Powell, 
2012).  Dynamic  programming  has  historically  performed  well  for  problems  requiring 
a  sequential  allocation  of  resources. 

Workman  (2009)  builds  a  linear  program  to  optimize  recruitment  and  promotion 
in  order  to  develop  an  indigenous  security  force.  This  model  replaces  basic  heuristics 
used  previously  with  a  single  model  that  incorporates  the  entire  enlisted  and  officer 
force  across  several  scenarios.  The  model  is  demonstrated  with  current  data  from 
the  Afghan  National  Army  and  provides  key  insights  on  the  feasibility  of  potential 
courses  of  action. 

The  Manpower  Long  Range  Planning  System  (MLRPS)  (Gass  et  al,  1988)  uses 
Markov  chains  and  linear  programming  to  project  and  optimize  the  strength  of  the 
US  Army  over  planning  horizons  of  10  and  20  years.  This  model  has  been  used 
by  the  Army  Office  of  the  Deputy  Chief  of  Staff  of  Personnel  to  determine  which 
policy  changes  are  required  in  order  to  meet  force  structure  requirements.  The  20 
year  horizon  Manpower  Planning  Model  differentiates  the  force  according  to  grade 
and  time  in  service  indices  and  calculates  the  optimal  long  term  policies  to  shape  the 
total  force.  The  10  year  horizon  Manpower  Requirements  Model  differentiates  the 
force  according  to  grade  and  skill  indices  and  calculates  the  optimal  policies  to  meet 
career  held  demands  over  the  intermediate  term. 

Corbett  (1995)  constructs  the  Officer  Accession/ Branch  Detail  Model  (OA/BDM), 
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a  goal  programming  model  to  optimize  accessions  for  US  Army’s  Officer  Personnel 
Management  Directorate.  Analysis  of  outputs  from  OA/BDM  indicates  that  assump¬ 
tions  and  the  corresponding  recommendations  of  the  previous  manpower  planning 
program  are  optimistically  biased.  OA/BDM  is  also  used  to  analyze  potential  courses 
of  action  to  correct  unbalanced  overages  of  junior  officers. 

The  Accession  Supply  Costing  and  Requirements  Model  (AS CAR)  is  a  goal  pro¬ 
gramming  model  that  optimizes  enlisted  recruitment  into  the  US  Armed  Forces  across 
all  branches  of  service  (Collins  et  al.,  1983).  ASCAR  was  used  by  the  Congres¬ 
sional  Budget  Office  and  the  Office  of  the  Secretary  of  Defense  to  successfully  predict 
whether  the  military  would  be  able  to  simultaneously  meet  personnel  quality  goals 
and  total  end  strength  goals  as  it  transitioned  to  an  all-volunteer  force. 

Charnes  et  al.  (1972)  utilize  a  goal  programming  model  for  General  Schedule 
civilian  manpower  management  in  the  US  Navy.  This  goal  programming  model  op¬ 
timizes  recruitment  based  on  requirements  and  cost  and  significantly  improves  the 
ability  to  optimize  these  factors  in  the  presence  of  truncational  effects  such  as  retire¬ 
ment.  Charnes  et  al.  (1972)  demonstrate  their  results  with  a  hypothetical  numerical 
illustration. 

Dimitriou  et  al.  (2013)  use  multivariate  Markov  chains  to  model  a  workforce  tran¬ 
sitioning  within  and  between  different  departments  or  subgroups  in  a  hierarchical 
personnel  structure.  This  construct  allows  the  user  to  evaluate  the  consequences  of 
different  training  courses  and  preparation  classes.  Goal  programming  techniques  are 
then  used  to  achieve  a  desirable  structure  at  minimum  cost. 

Grinold  (1976)  uses  dynamic  programming  with  embedded  Markov  chains  to  con¬ 
struct  an  optimal  policy  for  naval  aviator  recruitment.  Markov  chains  determine  the 
future  demand  for  personnel.  To  optimize  the  decisions  to  meet  forecasted  demands, 
Grinold  (1976)  uses  a  linear-quadratic  optimal  control  problem,  which  is  a  special 


11 


form  of  Markov  decision  process.  The  linear-quadratic  optimal  control  problem  pro¬ 
duces  an  optimal  decision  rule  for  any  finite  planning  horizon  that  is  a  linear  function 
of  existing  manpower  stocks. 

Approximate  Dynamic  Programming. 

Approximate  dynamic  programming  extends  many  of  the  benefits  of  dynamic  pro¬ 
gramming  to  problems  where  solving  the  DP  is  computationally  intractable  (Powell, 
2012).  Approximate  dynamic  programming  uses  Monte  Carlo  simulation  to  sample 
possible  outcomes  of  a  given  model  and  find  an  approximate  solution  using  an  esti¬ 
mated  future  cost  function.  Many  approximate  dynamic  programs  have  been  proven 
to  converge  to  the  true  optimal  solution  if  the  number  of  sampling  iterations  is  suffi¬ 
ciently  large. 

Song  &  Huang  (2008)  use  a  basic  Successive  Convex  Approximation  Method 
(SCAM)  as  well  as  a  modified  SCAM  algorithm  to  approximate  value  functions  for 
a  multi-stage  workforce  problem  with  stochastic  demand.  Their  algorithm  creates  a 
piecewise  linear  approximation  of  the  value  of  hiring,  firing,  or  transferring  personnel 
for  different  departments  with  stochastic  flows  between  departments  and  out  of  the 
system.  The  algorithm  is  shown  to  provide  solutions  with  near-optimal  values  for 
problems  small  enough  to  be  solved  exactly  to  make  the  solution  quality  comparison. 

2.2  Approximate  Dynamic  Programming  Techniques 

Dynamic  programming  has  been  demonstrated  to  optimize  stochastic  problems 
exceptionally  well.  The  foundation  of  dynamic  programming  is  the  Bellman  Equa¬ 
tion,  which  provides  the  basis  for  determining  actions,  x,  that  maximize  expected 
immediate  contributions  due  to  the  state  of  the  system  at  time  t,  St,  and  actions 
taken,  C(St,x ),  as  well  as  expected  future  contributions  (Bellman,  1955).  The  value 
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of  these  expected  future  contributions  is  captured  by  the  expected  value  of  the  po¬ 
tential  states  to  which  the  system  may  transition,  l/(S't_|_i)|S't.  The  discount  factor, 
7,  is  used  to  weight  the  value  of  immediate  contributions  compared  to  contributions 
in  the  future.  The  Bellman  equation  is  as  follows: 

V(St)  =  maxE[C(5t,i)  +7y(5't+i)|5't],  (1) 

X 

where  V ( St )  is  the  value  of  the  current  state. 

The  primary  drawback  of  dynamic  programming  for  large  scale  manpower  ap¬ 
plications  is  the  complexity  inherent  in  computing  expectations  in  the  presence  of 
multi-dimensional  data.  Problems  with  computational  complexity  cine  to  dimensional 
data  are  commonly  referred  to  as  the  “curse  of  dimensionality”  (Bellman,  1957)  or 
alternately  the  “three  curses  of  dimensionality” .  The  three  curses  refer  to  problems 
arising  from  dimensionality  in  the  state  space,  action  space,  and  outcome  space  of  a 
problem.  Since  dynamic  programming  relies  on  explicitly  computing  every  possible 
combination  of  events  and  the  value  associated  with  this  occurrence,  even  small  levels 
of  dimensionality  can  result  in  significant  computational  problems. 

Approximate  dynamic  programming  techniques  are  applied  to  solve  many  prob¬ 
lems  that  are  well  suited  to  a  dynamic  programming  approach,  but  are  computa¬ 
tionally  intractable.  Several  techniques  used  in  combination  allow  for  near-optimal 
solutions  to  problems  that  cannot  be  fully  evaluated  by  dynamic  programming  algo¬ 
rithms.  ADP  can  resolve  problems  arising  from  dimensionality  by  utilizing  a  sampling 
technique  to  estimate  an  approximation  of  the  value  of  a  policy  instead  of  finding  the 
exact  expectation  of  the  value  (Bellman  &  Dreyfns,  1959;  Powell,  2009). 

A  critical  difference  between  dynamic  programming  and  approximate  dynamic 
programming  is  the  use  by  ADP  of  a  forward  pass  with  Monte  Carlo  simulation  of 
possible  random  outcomes  to  determine  values  of  states  and  actions.  ADP  provides 
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an  approximate  valuation  and  near-optimal  solution,  though  many  algorithms  have 
been  proven  to  converge  to  an  optimal  solution  if  the  states  are  sampled  sufficiently 
often  (Ahner  &  Parson,  2014;  Powell,  2007). 

While  Monte  Carlo  simulation  relieves  the  problem  of  exactly  solving  for  values 
of  each  state  and  action  combination,  many  problems  still  experience  significant  com¬ 
binatorial  effects  even  from  approximately  solving  for  those  values.  Fortunately,  by 
adapting  the  formulation  of  the  problem  to  use  a  post-decision  state,  Sf,  much  of  this 
combinatorial  effect  can  be  alleviated.  The  post-decision  state  can  more  compactly 
represent  the  possible  outcomes  without  losing  any  information  for  many  problems 
(Van  Roy  et  al,  1997).  The  value  of  the  post-decision  state,  VX(S^),  is  expressed  as 
follows: 

V'*(S?)=EH'(St+,)|Sf].  (2) 


The  following  adaptation  to  the  Bellman  equation  is  made  to  make  use  of  this  powerful 
concept: 


V:_l{Sxt_l)=  E  max  C  (St,  x)  +  (S*)  |  S^-i 


(3) 


For  highly  dimensional  problems,  assigning  values  to  every  state  is  not  compu¬ 
tationally  feasible,  even  with  the  advantages  described  above.  To  circumvent  this 
limitation,  value  function  approximations  are  used  to  compactly  represent  values  of 
a  large  number  of  states.  Approximate  dynamic  programming  algorithms  use  these 
functions  to  take  advantage  of  structure  within  the  problem  and  efficiently  learn  about 
a  large  number  of  state  spaces  without  visiting  each  possible  outcome.  Using  a  value 
function  approximation  in  combination  with  approximate  policy  iteration  is  an  effec¬ 
tive  method  to  learn  about  the  value  of  different  states  while  periodically  using  that 
information  to  improve  the  optimal  policy  (Powell,  2012). 
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Approximate  Policy  Iteration  (API). 

Approximate  policy  iteration  fixes  a  specified  policy  and  evaluates  potential  states 
and  outcomes  associated  with  that  policy.  This  information  is  then  used  to  update 
the  policy  and  reevaluate.  API  using  parametric  modeling  with  linear  basis  functions 
has  received  a  great  deal  of  attention  for  its  ability  to  use  linear  regression  to  derive 
information  from  a  series  of  observations.  These  basis  functions,  also  known  as  inde¬ 
pendent  variables,  covariates,  or  features,  must  be  carefully  selected  for  the  algorithm 
to  effectively  approximate  the  value  function  (Bertsekas  &  Tsitsiklis,  1995).  The  se¬ 
lected  set  of  basis  functions  is  annotated  J7,  with  individual  features  /  e  T .  The 
crux  of  parametric  modeling  is  to  project  the  true  value  function  onto  the  space  of 
the  basis  functions.  If  the  basis  functions  are  not  appropriate  to  the  problem  and  the 
true  function  does  not  he  within  the  span  of  the  basis  functions,  the  approximation 
cannot  converge  to  the  true  value  function  (Scott  et  al.,  2014).  Use  of  these  basis 
functions  results  in  the  following  modification  to  the  value  of  the  post-decision  state 
in  the  Bellman  equation: 


vnsf)  =  T  0/MSf)  =  »T<HSf),  (4) 

feT 

where  6  is  the  vector  of  weights,  ( 9f)f  G  J c,  associated  with  the  basis  functions.  The 
vector  of  basis  function  values  for  a  post-decision  state  S*  are  then  defined  by  <j)f(Sf ). 

When  using  the  parametric  modeling  approach,  the  selection  of  the  optimal  deci¬ 
sion  is  adapted.  For  a  given  weight  vector  9 ,  the  policy  Xn(St\9)  is  given  by: 

X7T(St\9)  =  argma  x[C(St,x)  +  'y9T(/)(S^)].  (5) 

X 

Equation  2.5  is  referred  to  as  the  inner  maximization  problem.  This  problem 
is  solved  by  a  linear  or  non-linear  program,  depending  on  whether  any  non-linear 


15 


features  have  been  selected  as  a  part  of  0(5'f).  In  the  case  of  a  cost  function  as 
opposed  to  the  contribution  function  outlined  above,  the  problem  is  treated  as  a 
minimization  problem  and  the  equation  is  modified  as  follows: 

X*(St\d)  =  aigmm[C(St,x)+'y6T(i>(Sf)}.  (6) 

X 

Least  squares  temporal  differences  with  Bellman  error  minimization  is  an  exten¬ 
sion  of  parametric  modeling  for  infinite  horizon  problems  in  which  the  contributions 
of  a  number  of  potential  states  are  evaluated  while  a  policy  is  fixed  according  to  an 
initial  estimate  of  a  set  of  parameters,  6  (Bradtke  &  Barto,  1996).  Lagondakis  &  Parr 
(2003)  extends  this  algorithm  to  use  a  linear  architecture  to  approximate  state-action 
pairs  in  high  dimension  problems.  Bellman  error  captures  the  difference  between 
the  value  function  approximation  and  the  observed  values  being  approximated.  This 
method  evaluates  the  value  of  being  in  a  state  as  the  observed  contribution  summed 
with  the  discounted  value  of  the  resulting  post-decision  state  based  on  the  parameter 
6  (Scott  et  a/.,  2014). 

The  matrix  (see  Equation  7)  records  the  numerical  value  of  the  basis  functions 
for  N  sampled  post-decision  states,  while  the  matrix  records  the  resulting  post¬ 
decision  state  after  information  has  been  received  and  a  policy  has  been  implemented. 
Additionally,  the  observed  contributions  are  recorded  in  a  vector  Ct: 

<KSf-ltl)T  <KSftl)T  ct,  i 

$t- 1=  ;  ,$*=  ;  ,ct=  :  .  (7) 

Ordinary  least  squares  regression  is  performed  to  identify  the  impact  each  feature 
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has  on  the  observed  contributions: 


9  =  -  7$t)T(^-i  -  7^)]_1(^-i  -  7 ®t)TCt.  (8) 

Least  squares  approximate  policy  iteration  with  instrumental  variables  Bellman 
error  minimization  is  an  efficient  technique  to  obtain  a  consistent  estimate  of  9  with¬ 
out  modeling  the  noise  of  the  system  (Bradtke  &  Barto,  1996;  Scott  et  al,  2014). 
Instrumental  variables  are  used  to  reduce  noise  within  an  approximate  policy  itera¬ 
tion  algorithm  because  they  are  correlated  with  the  regressor  and  uncorrelated  with 
the  errors  and  observations  (Ma  &  Powell.  2010).  Equation  2.9  shows  the  modified 
instrumental  variables  regression  equation  as  presented  by  Scott  et  al.  (2014). 

9  =  1  -  7^)]_1(^-i)Ta.  (9) 

Approximate  Value  Iteration  (AVI). 

An  alternative  to  the  API  with  a  parametric  modeling  approach  is  the  use  of  ap¬ 
proximate  value  iteration.  As  opposed  to  API,  where  policies  are  fixed  and  then  eval¬ 
uated,  AVI  continually  refines  the  value  approximation  and  immediately  updates  the 
current  policy  accordingly.  Use  of  separable  piecewise  linear  approximations  within 
an  AVI  framework  has  garnered  significant  attention  clue  to  their  ability  to  map  a 
wide  variety  of  value  functions. 

The  Separable  Projective  Approximation  Routine  (SPAR)  algorithm  (Powell,  2007) 
is  an  effective  piecewise  linear  approximation  method.  SPAR  samples  values  of  states 
and  uses  a  monotone  structure  to  update  the  value  function  approximation  for  many 
states  at  once.  Any  slope  approximations  that  violate  monotonicity  are  averaged 
with  the  updated  value  to  maintain  structure.  A  key  benefit  to  SPAR  is  its  proof  of 
convergence,  as  demonstrated  by  Powell  (2007). 
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The  Stochastic  Hybrid  Approximation  Procedure  (SHAPE)  algorithm  differs  from 
SPAR  in  that  it  uses  a  nonlinear  approximation  of  the  value  function  (Cheung  & 
Powell,  2000).  This  nonlinear  approximation  is  updated  by  iteratively  sampling  the 
stochastic  system  to  update  the  gradients  of  the  value  function.  For  this  algorithm  to 
be  effective,  a  close  initial  approximation  is  critical.  While  not  a  separable  piecewise 
linear  approximation,  SHAPE  is  closely  related  to  SPAR  and  other  piecewise  linear 
value  function  approximation  algorithms  in  that  the  algorithm  uses  stochastic  gradi¬ 
ent  sampling  and  takes  advantage  of  a  known  problem  structure  to  develop  a  value 
approximation. 

The  Concave,  Adaptive  Value  Estimation  (CAVE)  algorithm  is  very  similar  to 
the  SPAR  algorithm,  but  uses  a  different  technique  to  correct  monotonicity  violations 
and  can  be  applied  to  monotonicity  of  slopes  for  concave  value  functions  (Godfrey 
&  Powell,  2002).  Instead  of  using  averages  to  correct  monotonicity  or  concavity,  the 
CAVE  algorithm  expands  the  area  being  updated  by  the  new  information.  Though 
the  general  CAVE  algorithm  does  not  have  a  convergence  proof,  variations  of  the 
CAVE  algorithm  have  been  proven  to  converge  (Topaloglu  &  Powell,  2003;  Ahner  & 
Parson,  2014),  and  CAVE  has  been  shown  to  outperform  SPAR  for  many  problems 
(Godfrey  &  Powell,  2001).  CAVE  has  been  applied  with  significant  success  to  a 
number  of  high  dimension  weapon  target  allocation  and  resource  allocation  problems 
with  stochastic  demand. 
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III.  Methodology  &  Problem  Formulation 


3.1  MDP  Formulation 


The  purpose  of  this  formulation  is  to  provide  a  framework  for  an  executable  model 
to  find  the  best  possible  policy.  This  policy  is  found  by  minimizing  the  cost  of  the 
defined  objective  function.  This  objective  function  is  comprised  of  the  cost  of  the 
current  state  as  well  as  the  expected  future  cost  of  states  resulting  from  a  combination 
of  the  current  state  and  chosen  actions.  The  actions  selected  are  the  accessions  and 
promotion  decisions  for  the  following  year,  so  the  cost  associated  with  the  combination 
of  the  current  state  and  action,  typically  defined  as  C(S,x ),  is  not  affected  in  the 
present  by  the  action  x.  Thus,  the  current  contribution  is  defined  by  C(S),  and  the 
consequence  of  a  given  decision  is  captured  by  the  expected  value  of  future  states.  The 
objective  for  the  infinite  horizon  formulation  is  to  minimize  expected  total  discounted 
cost,  with  the  objective  function  defined  as: 


min 

7rElI 


E' 


oo 
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L  t= o 


(10) 


where  7  is  the  discount  factor,  7r  is  a  single  policy,  and  fl  is  the  set  of  all  possible 
policies. 

The  alternative  finite  horizon  objective  function  is  defined  as: 


min 

7rEll 
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(ii) 


The  USAF  officer  sustainment  problem  is  constructed  such  that  a  single  set  of 
requirements  and  transition  rates  is  assumed  valid  for  any  future  time  period.  The 
parametric  modeling  formulation  of  the  system  is  constructed  as  an  infinite  time 
horizon  problem  with  annual  increments.  This  avoids  calculating  a  separate  value  for 
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an  identical  state  in  a  different  time  period.  The  set  of  decision  epochs  is  denoted  as: 


T  =  {1,2,...}.  (12) 

For  the  separable  piecewise  linear  formulation,  a  distinct  value  approximation  for  each 
decision  must  be  developed  for  each  time  period,  so  the  problem  is  constructed  as  a 
finite  horizon  problem  with  T  annual  increments: 


r  =  {i,2,...,r}. 


(13) 


The  state  of  each  officer  in  the  system  is  defined  by  an  attribute  vector  a,  composed 
of  three  numerical  attributes.  These  attributes  are  numerical  indices  representing 
AFSC,  grade,  and  CYOS.  Due  to  the  grade  structure  described  in  Chapter  1,  this 
model  consolidates  the  three  initial  grades  into  one  index  of  the  grade  class  descriptor, 
leaving  four  grades  modeled.  Additionally,  the  AFSCs  are  limited  to  the  Line  of 
the  Air  Force  competitive  category.  This  group  consists  of  54  AFSCs  and  contains 
approximately  80%  of  the  officers  in  the  USAF.  The  excluded  AFSCs  include  medical, 
dental,  legal,  and  chaplain  career  fields,  whose  behavior  and  constraints  are  sufficiently 
different  to  warrant  a  separate  model. 

Let  the  attribute  vector  a  denote  a  specified  combination  of  these  three  attributes, 
denoted  as  ai,  02,  and  <23: 


f  \ 

a  1 

^AFSC^ 

a  = 

0,2 

= 

Grade 

vv 

^CYOSy 

(14) 
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The  individual  sets  containing  all  iterations  of  a  single  attribute  h  are  annotated  as: 


Ah  =  Set  of  all  possible  officer  attributes  a^.  (15) 

The  full  set,  A,  for  the  general  problem  contains  the  full  range  of  possible  combina¬ 
tions  of  m  AFSCs,  n  grades,  and  q  CYOSs.  For  the  full  problem  examined,  these 
values  are  54,  4,  and  30,  respectively. 

A  =  Set  of  all  possible  officer  attribute  vectors  a,  where  |*4|  =  mnq.  (16) 

The  state  of  the  USAF  personnel  system  is  defined  by  the  number  of  resources  (of¬ 
ficers)  of  each  attribute  vector,  a.  Denote  Sa as  the  number  of  officers  possessing 
attributes  as  defined  by  the  attribute  vector  a. 

Sa,t  —  number  of  resources  at  time  t  for  attribute  vector  a.  (17) 

The  pre-decision  state  is  a  vector  of  size  \A\,  which  is  defined  as  St  =  (Sajt)a  G  A, 
expressed  in  vector  form  as: 

1,1,2,  •••,  <Si,l ,q, 

where  the  subscript  t  is  suppressed  for  each  Sai,a2,a3  for  notational  simplicity. 

There  are  m  +  n  decisions  made  annually,  defined  by  the  set  V.  Decisions 
{1,2,  ...,m}  are  the  accessions  decisions  for  each  AFSC,  determining  how  many  peo¬ 
ple  to  recruit  as  junior  officers.  Decisions  {m  +  1,  m  +  2, ...,,  m  +  n  —  1}  are  decisions 
on  the  ratio  of  eligible  officers  to  promote.  For  the  full  problem,  m  =  54  and  n  =  4. 


,n,g?  s2,  1,1,  •••,  sm  ,n,q\  7 


(18) 


St  =  [Si, 1,1,  S' 
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The  action  selected  is  defined  by: 


Xt  —  (. Xd}t)d£v ■  (19) 

Additionally,  there  are  upper  bounds,  f3d,  and  lower  bounds,  Q,  for  each  decision, 
as  specified  by  the  decision  maker.  For  the  accession  decisions,  these  bounds  arise  out 
of  pipeline  considerations,  such  as  training  constraints  or  minimum  training  levels  to 
sustain  facilities.  For  the  promotion  decisions,  extremely  high  or  low  values  can  have 
significant  secondary  effects  on  the  quality  of  the  force. 

A i  <  xd)t  <  C d  Vde  v.  (20) 

A  deterministic  transition  is  made  from  pre-decision  to  post-decision  state,  Sf, 
with  each  non-promotion  eligible  group  moving  to  the  next  CYOS  index  or  retiring. 
Accessions  fill  the  first  year  group  (CYOS)  index  for  each  AFSC  and  promotion 
policies  are  attached  at  the  end,  increasing  the  vector  size  by  n  —  1.  The  promotion 
policies  are  included  in  the  post-decision  state  because  this  policy  determines  the  next 
transition,  but  the  information  co  as  to  what  stochastic  result  this  policy  will  generate 
has  not  been  received  yet.  The  states  with  the  CYOS  index  associated  with  the  first 
year  of  a  potential  grade  (i.e.,  new  promotions)  are  set  to  zero,  since  these  will  be 
filled  by  the  stochastic  promotion  transitions.  The  post-decision  state  vector  is  then 
defined  by: 


St  —[Xl ,  S\  ,1,1,  Si, 1,2,  •••,  Si, 0,  Si, 2,1,  ...,  Si,n,q_i,  X<2i 
•  •  •  ,  Sm,n,g—  1  ,  Xm+l  ,  •  •  •  ,  -Ym-|_n]  . 


(21) 


Again,  the  subscript  t  is  suppressed  for  each  Sai,a2,a3  for  notational  simplicity. 
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The  cost  function,  C,  is  defined  by  the  total  sum  of  the  shortages  by  grade  and 
AFSC  and  the  overages  above  the  maximum  number  of  personnel  allowed  in  the 
system  (i.e.,  the  end  strength).  To  differentiate  between  the  criticality  of  shortages  for 
various  AFSCs,  an  AFSC  criticality  coefficient,  (bai)ai^Ai  is  used  to  scale  the  shortage 
cost  for  each  AFSC.  Requirements  by  AFSC  (oi)  and  grade  (02)  are  annotated  as 
(-Rai,a2)(ai,a2)e.4ix.42  f°r  &U  combinations  of  oq  and  a2.  Let  F  denote  the  maximum  end 
strength  and  let  e  denote  the  end  strength  criticality  coefficient.  The  end  strength 
criticality  coefficient  allows  weighting  the  relative  importance  of  end  strength  and 
shortages  using  decision  maker  preferences.  Let  Ch  and  Ce  denote  the  cost  due  to 
shortage  and  end  strength,  respectively.  Then, 


m  n 


cH(s )  =  J2  E  M  -  E  s“  • 

ai=l  a2= 1  \  <23=1 


(22) 


and 


m  n  q 


CE(S)  =  e[ 

\a\  =  l  a.2  —  1  03=1 


(23) 


Let  C(S)  denote  the  total  cost  associated  with  a  given  state,  which  is  simply  the  sum 
of  the  two  partial  costs,  Ch  and  Ce'- 


c(S)  =  EE*« 

ai=l  a2  =  l  \ 


q  \  + 

)  +e 

a3  =  1  / 


EEES* 

\Cli  =  l  O2  — 1  03  =  1 


(24) 


Promotion  and  retention  transitions  are  treated  as  discrete  stochastic  functions, 
each  following  a  binomial  distribution.  The  probability  of  any  eligible  individual 
transitioning  to  the  next  highest  grade  is  determined  by  the  promotion  decisions  Xd 
for  d  =  {m  +  1,  m  +  2, ...,  m  +  n  —  1}: 


Pr(S ; 


ai, 0,2+1, C13 


s* 


—  —  I  01,02,03 


J 


(xdy(  1  -  Xd)sx^3 


-J 


(25) 
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Those  who  are  not  selected  for  promotion  remain  in  their  current  grade.  After  all 
promotion  transitions  are  complete,  the  probability  of  any  individual  transitioning 
to  the  next  year  group  (CYOS)  within  the  individual's  current  grade  and  AFSC  is 
defined  by  the  retention  parameter,  pa,  that  can  be  derived  from  historic  data  and  an 
environmental  parameter.  This  additional  environmental  parameter  can  be  used  to 
scale  historic  retention  rates  to  reflect  beliefs  about  changing  conditions  in  the  future, 
such  as  economic  and  operations  tempo  impacts  to  force  retention. 

Pr(Sa,t+1  =  j)  =  fJA  (p„)*(l  -  pa)s‘ .-V  (26) 

Finally,  the  promotion  decisions  are  no  longer  necessary,  so  the  last  n  indices  of 
the  post-decision  state  are  dropped  as  the  transition  back  to  a  pre-decision  state  is 
completed. 

3.2  Approximate  Dynamic  Programming  Algorithms 

API:  Least  Squares  Temporal  Differences. 

Once  the  MDP  is  formulated,  an  iterative  least  squares  approximate  policy  itera¬ 
tion  algorithm  with  ordinary  least  squares  Bellman  error  minimization  is  implemented 
to  find  solutions  to  the  implemented  formulation.  Algorithm  1  is  an  adapted  version 
of  the  algorithm  presented  by  Scott  et  al.  (2014). 

At  each  iteration,  6  is  smoothed  according  to  a  step  size,  a.  Experimentation 
revealed  a  linearly  decreasing  stepsize  significantly  outperforms  a  static  stepsize.  This 
allowed  rapid  updates  early  in  the  algorithm  while  benefiting  from  a  more  refined 
estimation  as  the  algorithm  converged  to  a  solution. 

The  selected  basis  functions  (features)  are  the  interactions  between  decisions  taken 
to  some  power  ^  and  the  sums  of  the  states  Sa  for  various  combinations  of  attributes. 
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Algorithm  1  LSTD  Algorithm 
1:  Initialize  9  as  a  vector  of  zeros. 

2:  for  j  =  1  to  M  (Policy  Improvement  Loop) 

3:  Update  a  =  (M  —  j  +  1)/ (5 M) 

4:  for  i  =  1  to  A  (Policy  Evaluation  Loop) 

5:  Simulate  a  random  post-decision  state  S*_ ,  t 

6:  Simulate  the  transition  to  the  pre-decision  state  St)% 

7:  Solve  MINLP  for  optimal  decision  X77(St\9)  =  arg min^. [C(<S't)  +7 9T  <f)(S%)\ 

8:  Record  C(St ji),  (j)(S^_l  i),  and 

9:  End 

10:  Compute  9  =  [($t_i  -  7<hi)r(<f>t-i  -  -  7 §t)TCt 

11:  Update  6  =  (a)9  +  (1  —  a) 9 

12:  End 


This  selection  helps  the  algorithm  relate  current  states  to  potential  actions  and  keeps 
the  problem  decomposable.  For  any  given  pre-decision  state,  the  inner  minimization 
problem  becomes: 


minZiAi  +  Z2Xf  +  ...  +  Z^Xf  +  Z^+ 1X2  +  ...  +  ^>(m+n-i)A'm+n_1>  (27) 

subject  to  any  pre-dehned  bounds  on  decisions,  /3d  and  Q.  Each  coefficient  Zg  is 
determined  by  a  number  of  current  states  and  the  parameter  9.  The  inner  minimiza¬ 
tion  problem  is  a  large,  decomposable  mixed-integer  nonlinear  program  with  integer 
decisions  (Xdjt)  for  all  d  <  m  and  continuous  decisions  Xd  t  for  all  d  >  m.  Decorn- 
posability  allows  separation  into  m  small  integer  nonlinear  programs  and  n—l  small 
continuous  nonlinear  programs.  Each  of  these  problems  is  solved  easily. 

As  discussed  in  Chapter  2,  instrumental  variables  have  been  demonstrated  to  sig¬ 
nificantly  improve  regression  performance.  Scott  et  al.  (2014)  present  this  adaptation 
to  the  LSTD  algorithm  with  ordinary  least  squares  regression  by  changing  Step  9  in 
Algorithm  1  to  reflect  the  alternative  regression  equation.  The  modified  algorithm  is 
presented  as  Algorithm  2: 

Many  approximate  policy  iteration  algorithm  implementations  use  uniform  ran- 
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Algorithm  2  LSTD  Algorithm  with  Instrumental  Variables 
1:  Initialize  9  as  a  vector  of  zeros. 

2:  for  j  =  1  to  M  (Policy  Improvement  Loop) 

3:  Update  a  =  (M  —  j  +  1)/ (5 M) 

4:  for  i  =  1  to  A  (Policy  Evaluation  Loop) 

5:  Simulate  a  random  post-decision  state  S*_ ,  t 

6:  Simulate  the  transition  to  the  pre-decision  state  St)% 

7:  Solve  MINLP  for  optimal  decision  X77(St\9)  =  arg min^. [C(<S't)  +  7 9T(j)(Sf)] 

8:  Record  C(St ji),  0(S'*_lii),  and 

9:  End 

10:  Compute  9  =  [(ft-i)T ($t-i  -  7<ht)]_1(<f>t-i):rC,t 

11:  Update  6  =  (a)0  +  (1  —  a) 9 

12:  End 


dom  sampling  of  possible  post-decision  states  in  step  4.  To  improve  the  ability  of  the 
parametric  regression  to  separate  the  effects  of  each  basis  function  (i.e.,  feature)  on 
the  value  function,  Algorithm  3  uses  Latin  hypercube  sampling  (LHS)  to  generate 
an  improved  set  of  post-decision  states.  LHS  designs  help  ensure  uniform  sampling 
across  all  possible  dimensions  thereby  improving  the  ability  of  the  regression  to  iden¬ 
tify  which  regressors  are  significantly  affecting  the  cost  function  (McKay  et  al,  1979). 
The  adapted  algorithm  with  both  instrumental  variables  and  Latin  hypercube  sam¬ 
pling  is  shown  in  Algorithm  3. 


Algorithm  3  IVLSTD  Algorithm  with  Latin  Hypercube  Sampling 

1:  Initialize  6  as  a  vector  of  zeros. 

2:  for  j  =  1  to  M  (Policy  Improvement  Loop) 

3:  Construct  an  LHS  design  of  N  post-decision  states,  [5'f_1 1;  S*_12,  •••,  Sf_  1  N] 

4:  Update  a  =  ( M  —  j  +  1)/ (5 M) 

5:  for  i  =  1  to  A  (Policy  Evaluation  Loop) 

6:  Identify  the  pre-dehned  post-decision  state  S*_1{ 

7:  Simulate  the  transition  to  the  pre-decision  state  St,i 

8:  Solve  MINLP  for  optimal  decision  X77(St\9)  =  argmina.[C(<S't)  +7 6,r0(S'f)] 

9:  Record  C(Stti),  0(S'taLl  i),  and  0(5^,) 

10:  End 

11:  Compute  9  =  [(4>i_i)T(<f>t_i  -  7<ht)]^1(<f>t_i):rC,t 

12:  Update  9  =  ( a) 9  +  (1  —  a)9 

13:  End 


26 


AVI:  Concave  Adaptive  Value  Estimation. 


The  alternative  finite  horizon  formulation  of  the  problem  is  solved  using  a  version 
of  the  general  CAVE  algorithm  proposed  by  Godfrey  &  Powell  (2002).  Godfrey  & 
Powell  use  this  algorithm  to  develop  a  piecewise  linear  value  function  approximation 
for  a  single  state,  V (s)  for  future  costs,  given  the  current  state  and  action.  We  adapt 
this  convention  to  develop  a  value  function  approximation  for  each  decision,  X(jt] 
which  is  equivalent  to  the  corresponding  portion  of  the  post-decision  state,  l  (,  for 
dt  =  di  as  well  as  the  promotion  decisions. 

This  algorithm  uses  a  series  of  breakpoints  indexed  by  /cdt,  where  kdt  G  /Qt,  and 
Kdt  =  {0, 1, ...,  kmax } .  kmax  represents  the  maximum  number  of  allowable  breakpoints. 
These  breakpoints  are  annotated  (vkdt ,  ukdt) ,  where  vkdt  describes  the  slope  of  a  linear 
segment  projected  from  ukdt. 


Figure  1.  CAVE  Piecewise  Linear  Value  Approximation 

The  breakpoints  ukdt  are  ordered  such  that  u1  =  0  and  each  consecutive  point  is 
monotonically  increasing: 

u°  <  u1  <  ...  <  ukmax.  (28) 
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The  presence  of  concavity  in  the  problem  structure  indicates  that  the  slopes  are  also 
monotonically  decreasing: 

z/°  >  v1  >  ...  >  vkrnax.  (29) 

CAVE  uses  sampling  of  the  gradients  for  each  decision  to  improve  its  estimate  of 
the  slope  for  that  approximation.  This  sampling  is  accomplished  by  a  single  simula¬ 
tion  forward  in  time,  calculating  the  sample  gradients  {Xdt)  and  A^(Xdi)  for  the 
segments  being  evaluated,  k^  and  k~0 \ . 


Figure  2.  CAVE  Gradient  Sampling 


A  smoothing  interval,  Qdt  for  each  dt  is  initially  set  based  on  upper  and  lower 
interval  size  parameters,  and  .  This  update  interval  is  then  expanded  to  correct 
any  concavity  violations.  If  necessary,  new  breakpoints  are  inserted  at  the  ends  of 
the  update  interval. 


— 

■  -  Updated  Estimate  of  v2 

=  0 

u1 

u2 

u3 

u4 

Figure  3.  CAVE  Gradient  Update 


After  all  of  these  steps  are  accomplished,  the  minimum  update  interval  parameters, 
e^,  e\t  can  be  decreased  to  allow  the  algorithm  to  create  a  more  granular  approxi¬ 
mation  at  the  next  time  step.  The  step  size,  a  can  also  be  decreased  as  iterations 
are  completed  to  improve  value  approximation  convergence.  The  CAVE  algorithm  as 
adapted  from  Godfrey  &  Powell  (2001)  to  the  multiple  decision,  multiple  time  period 
problem  is  as  follows: 
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Algorithm  4  CAVE  Algorithm 
Step  1  :  Initialization 

1:  For  each  dt,  let  Kdt  =  0,  where  udt  —  0,  udt  —  0. 
2:  Initialize  parameters  edt,  edt,  and  a. 


3:  for  j  =  1  to  M 

Step  2  :  Collect  Gradient  Information 
4:  For  each  dt,  identify  the  policy  specified  by  the  current  value  function 

approximation,  Xdt  >  0. 

5:  For  all  decisions  simultaneously,  sample  the  gradients  A^  (Xdt ,  lo)  and 

A  d(Xdt,ou)  over  a  hnite  time  horizon  with  random  outcome  well 


6: 

7: 


9: 


Step  3  :  Define  Smoothing  Interval 

Let  k~dt  =  min {kdt  G  1Cdt  :  <  (1  -  a)u^t+1  +  aA~(Xdt)}. 

Let  k+  =  ma x{kdt  e  Kdt  :  (1  -  +  aA+(Xdt)  < 


Dehne  the  smoothing  interval  Qdi 


k  n, 

min{Adt  -  edt,  u^},  max{Adt  +  e+,ud^^} 


( fc+  +1)  (fc+  +1) 

If  udt  *  is  undehned,  then  set  udt  *  =  oo 

Create  new  breakpoints  at  Xdt  and  the  endpoints  of  Qdt  as  needed.  Since  a 
new  breakpoint  always  divides  an  existing  segment,  the  segment  slopes  on 
both  sides  of  the  new  breakpoint  are  the  same  initially. 


Step  4  :  Perform  Smoothing 

10:  For  each  segment  in  the  interval  Qdt,  update  the  slope  according  to 

vX,new  =  aAdt  +  (1  -  where  Adt  =  Adt(Xdt )  if  u\  <  Xdt  and 

Adt  =  Adt(Xdt)  otherwise. 

11:  Adjust  edt,  ed  ,  a  according  to  step  size  rules. 

12:  End 
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IV.  Results 


4.1  Defining  Model  Inputs  and  Measures 

Four  performance  measures  provide  an  overview  of  each  algorithm’s  level  of  success 
for  each  of  three  problem  instances  that  are  examined.  For  the  problem  instance 
solved  exactly  with  dynamic  programming,  results  for  the  ADP  algorithms  and  the 
benchmark  policy  are  reported  in  terms  of  mean  percentage  increase  in  cost  over  the 
optimal  policy  values  across  all  possible  states.  Mean  relative  optimality  gap  (MROG) 
is  reported  for  reduction  in  cost  in  the  small  problem  instance.  Mean  relative  increase 
in  shortages,  overages,  and  squared  deviation  are  also  reported  to  allow  for  a  more 
nuanced  insight  into  the  overall  desirability  of  a  given  policy.  A  modified  form  of 
these  measures  are  used  to  compare  ADP  policies  to  the  benchmark  (i.e.,  current 
model)  policies  for  problems  large  enough  to  require  simulation  to  evaluate.  For 
these  problem  instances,  percentages  for  the  ADP  algorithms  are  reported  in  terms 
of  percentage  improvement  (i.e.,  decrease  in  cost)  over  the  benchmark  policy.  We 
perform  50  simulations  covering  50  years  each,  with  each  simulation  beginning  at  an 
optimal  state  (i.e.,  no  shortages  or  overages).  Half- widths  are  reported  at  the  95% 
confidence  level  to  establish  statistical  significance.  Percent  reduction  in  shortages 
(RIS),  percent  reduction  in  overages  (RIO),  percent  reduction  in  cost  (RIC),  and 
percent  reduction  in  total  squared  deviation  (RSD)  are  reported.  RIC,  like  MROG, 
is  a  direct  comparison  of  performance  regarding  the  objective  function. 

For  implementations  of  the  LSTD  algorithm  applied  to  the  first  two  problem  in¬ 
stances,  the  inner  and  outer  loop  parameters  use  M  =  30  and  N  =  10000.  The 
discount  factor,  7,  is  set  at  0.95.  For  each  of  the  models,  the  end  strength  critical¬ 
ity  coefficient,  e,  and  AFSC  criticality  coefficients,  (fraJaie-Au  are  set  equal  to  one. 
Additionally,  the  benchmark  policy  is  calculated  using  the  HAF/Al  method  used  to 
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generate  the  benchmark  policy. 

For  the  CAVE  algorithm,  T  is  set  to  twice  as  large  as  the  maximum  career  length. 
This  helps  correctly  assess  the  overages  and  shortages  at  the  end  of  the  career  for  the 
decisions  being  made.  For  the  decisions  at  a  time  epoch  to  be  reasonably  represen¬ 
tative  of  the  decisions  that  will  actually  be  made  in  the  future,  the  impacts  of  those 
decisions  are  measured  over  a  significant  number  of  epochs.  The  decisions  made  at  the 
end  of  a  finite  time  horizon  model  will  be  biased,  since  the  model’s  reality  is  that  only 
a  short  number  of  years  are  relevant,  while  the  USAF  has  an  enduring  requirement 
for  officers.  Thus,  T  must  be  significantly  larger  than  the  maximum  career  length  in 
order  to  obtain  accurate  and  unbiased  stochastic  gradient  samples. 

4.2  Small  Problem  Instance  Definition  Results 

The  small  problem  instance  is  formulated  with  a  single  AFSC  (m  =  1),  single 
grade  (n  —  1),  and  four  year  groups  (q  —  4).  Additionally,  the  upper  accession  limit 
is  set  at  6  (Cd  =  6  for  d  =  1).  For  the  finite  horizon  CAVE  algorithm,  T  =  8. 

This  small  problem  allows  for  a  direct  comparison  of  ADP  and  benchmark  policies 
to  the  optimal  policy,  but  fails  to  replicate  a  significant  amount  of  the  complexity 
inherent  in  larger  instances  of  the  problem.  Additionally,  with  an  effective  career 
length  of  four  years,  random  deviation  is  less  likely  to  have  a  chance  to  compound  and 
create  significant  shortage  and  overage  costs  for  a  static  policy  compared  to  a  system 
in  which  the  effective  career  length  is  much  longer.  The  results  bear  this  assertion  out, 
with  the  benchmark  policy  beating  all  ADP  policies  tested  by  a  significant  margin. 

As  demonstrated  in  Table  1,  the  first  order  basis  functions  perform  poorly  for  all 
algorithm  variants  and  the  higher  order  basis  functions  generally  performed  better. 
Results  in  each  Table  1  are  shaded  on  a  color  scale  from  green  (good)  to  red  (bad). 
As  expected,  the  addition  of  instrumental  variables  and  Latin  hypercube  sampling 
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generally  improve  algorithm  performance,  with  the  exception  of  the  variant  using 
third  order  basis  functions.  Of  note,  repeated  runs  did  not  consistently  converge  to 
similar  solutions  in  terms  of  9  or  solution  quality,  which  points  to  potential  problems 
with  the  selected  basis  functions.  Given  that  good  solutions  consistently  performed 
well,  ten  runs  of  each  algorithm  were  performed,  and  the  best  performing  solution 
was  retained.  This  technique  was  utilized  for  the  larger  problem  instances  as  well. 
This  divergent  characteristic  did  introduce  significant  variance  in  the  performance  of 
the  LSTD  algorithms  throughout. 

Table  1.  LSTD  Optimality  Gap  (Small  Problem  Instance) 


The  CAVE  algorithm  outperforms  each  of  the  LSTD  algorithm  variants,  as  shown 
in  Table  2.  Of  note  is  the  level  of  performance  of  the  non-objective  function  measures. 
Each  of  the  algorithms  generally  decreased  accessions,  leading  to  increased  numbers 
of  overages  and  decreased  numbers  of  shortages.  The  rationale  for  the  benchmark 
policy  is  demonstrated  clearly  by  the  benchmark  policy’s  high  level  of  performance. 


Table  2.  ADP  Optimality  Gap  (Small  Problem  Instance) 


Algorithm 

MROG 

MRIS 

MRIO 

MRISD 

OLS 

26.57% 

78.38% 

-90.19% 

14.19% 

LSTD 

IV 

19.21% 

64.53% 

-82.92% 

1.68% 

IV  LHS 

18.46% 

54.32% 

-62.34% 

8.05% 

CAVE 

15.13% 

39.27% 

-30.28% 

8.96% 

Benchmark 

1.53% 

-22.03% 

45.84% 

22.12% 
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4.3  Medium  Problem  Instance  Definition  Results 


A  larger  problem  instance  was  examined  to  allow  for  a  level  of  complexity  that 
conld  not  be  obtained  with  a  problem  that  conld  be  solved  to  optimality  while  keep¬ 
ing  the  problem  simple  enough  to  easily  validate  the  performance  of  the  various 
algorithms  being  compared.  This  medium-sized  problem  instance  was  constructed 
with  four  AFSCs  (m  =  4),  three  grades  (n  =  3),  15  CYOSs  (q  =  15),  and  a  upper 
accession  limit  of  50  accessions,  (Q  =  50  for  d  <  4).  With  multiple  grades,  promotion 
decisions  are  now  assessed  to  determine  transition  rates  from  one  grade  to  another. 
We  examine  30  time  epochs  when  implementing  the  CAVE  algorithm  (T  =  30). 

As  this  problem  instance  was  examined,  repetitions  of  all  variants  of  the  LSTD 
algorithm  produced  significantly  different  6 s,  with  significantly  different  solution  qual¬ 
ities,  just  as  observed  in  the  small  problem  instance.  Like  the  solutions  from  the  small 
problem,  the  difference  in  outcomes  was  a  function  of  the  policy  produced,  as  those  so¬ 
lutions  that  performed  well  did  so  consistently.  For  each  of  the  ten  runs,  the  produced 
policy  was  simulated  to  examine  solution  quality,  and  the  algorithm  that  performed 
the  best  in  terms  of  the  objective  function  was  selected.  The  best  LSTD  algorithm 
results  for  each  combination  of  sampling  technique,  regression  technique,  and  basis 
functions  are  shown  in  Table  3. 

Table  3.  LSTD  Percentage  Improvement  from  Benchmark  (Medium  Problem  Instance) 


Basis 

Random  Sampling 

Latin  Hypercube  Sampling 

Ordinary  Least  Squares 

Instrumental  Variables 

Instrumental  Variables 

Functions 

RIC 

RSD 

RIC 

RSD 

RIC 

RSD 

4th  Order 

-32.86% 

-5.39% 

-11.31% 

-65.21% 

8.29% 

17.61% 

3rd  Order 

-12.44% 

-12.05% 

-12.77% 

5.44% 

-8.97% 

-46.37% 

2nd  Order 

-12.94% 

11.59% 

-11.51% 

4.83% 

5.74% 

21.88% 

1st  Order 

-102.42% 

-1025.40% 

-90.53% 

-867.50% 

-277.51% 

-4769.85% 

Results  that  failed  to  improve  on  the  benchmark  results  are  shown  in  red,  while 
results  that  show  some  level  of  improvement  are  shaded  from  red  to  green  according 
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to  the  quality  of  the  solution.  As  expected,  the  use  of  instrumental  variables  improved 
solution  quality  significantly.  Latin  hypercube  sampling  improved  the  solution  qual¬ 
ities  for  all  sets  of  basis  functions,  but  improved  the  solution  quality  of  complex  sets 
of  basis  functions  by  a  more  significant  margin  than  those  with  simpler  sets.  Given  a 
constant  inner  loop  sample  size,  N,  algorithm  variants  with  smaller  numbers  of  basis 
functions  have  relatively  larger  amounts  of  information  to  decipher  which  features  are 
impacting  the  observed  cost.  The  LHS  design  helps  algorithms  with  those  sets  too, 
but  is  at  its  most  useful  when  the  impact  of  these  sample  size  limitations  are  exac¬ 
erbated  by  large  numbers  of  semi-correlated  features.  The  algorithm  variants  with 
Latin  hypercube  sampling,  instrumental  variables,  and  either  second  or  fourth  order 
basis  functions  are  the  only  LSTD  algorithms  that  are  able  to  provide  policies  that 
improve  the  total  cost,  though  several  other  variants  show  improvements  in  squared 
deviation. 

When  observing  the  policies  generated  by  the  algorithms,  it  becomes  apparent 
that  the  LSTD  algorithm  simplifies  the  problem  by  generating  solutions  that  are  only 
pseudo-dynamic.  In  effect,  at  least  one  of  the  decision  policies  remains  static,  while 
the  other  decision  policies  are  adjusted  higher  or  lower  based  on  the  levels  of  shortages 
or  overages  observed.  This  limitation  is  likely  due  to  the  value  function  being  unable 
to  project  onto  the  span  of  the  basis  functions,  though  additional  samples  may  have 
improved  this  problem. 

Table  4.  ADP  Percentage  Improvement  from  Benchmark  (Medium  Problem  Instance) 


Algorithm 

RIC 

Half  Width 

RIS 

Half  Width 

RIO 

Half  Width 

RSD 

Half  Width 

Ordinary  Least  Squares 

-12.44% 

1.60% 

-23.14% 

3.08% 

20.93% 

7.86% 

-12.05% 

5.76% 

LSTD 

Instrumental  Variables 

-12.77% 

1.30% 

-23.84% 

2.48% 

21.98% 

6.88% 

5.44% 

4.98% 

IV  Latin  Hypercube  Sampling 

8.29% 

1.16% 

8.86% 

2.19% 

0.65% 

9.20% 

17.61% 

4.20% 

CAVE 

Accessions 

5.70% 

1.05% 

-5.52% 

2.19% 

48.96% 

4.79% 

35.24% 

2.86% 

Accessions  &  Promotions 

0.00% 

1.81% 

55.90% 

1.69% 

-190.47% 

20.10% 

-68.53% 

11.66% 

Two  versions  of  CAVE  were  applied.  The  first  utilizes  the  m  —  4  accessions 
decisions  and  the  n  —  1  =  2  promotion  decisions,  while  the  second  algorithm  only 
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models  the  m  —  4  accessions  decisions.  The  accessions-only  variant  of  the  CAVE 
algorithm  utilizes  the  benchmark  policies  for  the  n  —  1  =  2  promotion  decisions. 
The  results  using  these  two  algorithms  are  shown  in  Table  4.  Allowing  the  promotion 
rates  to  vary  from  the  benchmark  is  a  relaxation  of  a  problem  constraint.  Though  the 
relaxation  of  a  constraint  indicates  that  the  variant  with  promotion  decisions  should 
be  able  to  outperform  the  more  constrained  accessions-only  model,  the  reverse  is 
observed.  This  can  be  attributed  to  a  high  level  of  interaction  between  the  promotion 
and  accession  policies  that  inhibits  CAVE’s  ability  to  converge  to  a  quality  or  optimal 
solution.  This  is  a  significant  weakness,  given  that  non-linear  interactions  also  exist 
between  accession  policies. 

The  LSTD  algorithm  with  fourth  order  basis  functions,  instrumental  variables 
Bellman  error  minimization,  and  Latin  hypercube  sampling  outperformed  all  other 
algorithms  tested.  Additionally,  this  variant  showed  a  statistically  significant  decrease 
in  shortages  and  a  statistically  insignificant  decrease  in  overages,  meaning  that  this 
improvement  was  accomplished  due  to  the  dynamic  nature  of  the  solution  without 
detrimentally  impacting  overages  or  shortages.  The  CAVE  algorithm  with  accession 
policies  also  outperformed  the  benchmark  policy  by  a  statistically  significant  mar¬ 
gin,  with  comparable  performance  to  the  second  order  basis  function,  instrumental 
variable,  Latin  hypercube  variant. 

4.4  LAF  Full  Problem  Definition  &  Results 

For  the  large  problem  instance,  the  most  promising  algorithms,  LSTD  with  2nd 
and  4th  order  basis  functions  and  CAVE  with  accessions  only,  were  applied  to  a 
problem  of  identical  size  to  the  USAF  problem.  This  problem  is  formulated  with  54 
AFSCs  (m  =  54),  four  grades  (n  =  4),  and  30  CYOSs  (q  =  30).  For  the  CAVE 
algorithm,  T  =  60  due  to  the  maximum  30  year  career  length.  Three  behavioral 
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profiles  were  generated  to  represent  significant  differences  among  observed  behaviors 
of  different  AFSCs,  including  a  high  retention  profile,  a  low  retention  profile,  and  a 
standard  profile.  These  profiles  represent  the  varying  levels  of  demand  for  the  skill 
sets  of  different  career  fields  within  the  USAF.  Each  AFSC  was  assigned  to  one  of 
these  profiles,  then  randomly  increased  or  decreased  in  size  according  to  a  uniform 
random  distribution.  Uniformly  distributed  unbiased  variance  was  then  introduced 
to  the  retention  rates  of  the  AFSC’s  behavioral  profile  to  generate  career  fields  that 
are  similar,  but  not  identical.  This  procedure  creates  a  heterogeneous  mix  of  AFSCs 
with  different  sizes  and  retention  rates. 

Table  5.  ADP  Percentage  Improvement  from  Benchmark  (Large  Problem  Instance) 


Algorithm 

RIC 

Half  Width 

RIS 

Half  Width 

RIO 

Half  Width 

RSD 

Half  Width 

LSTD 

2nd  Order 

-49.89% 

0.79% 

-49.80% 

0.80% 

-2837% 

3791.96% 

-41.93% 

1.50% 

4th  Order 

-61.47% 

0.87% 

-30.30% 

0.71% 

-63228% 

8651.60% 

-779% 

35.99% 

CAVE 

Accessions 

2.82% 

0.90% 

-32.68% 

2.88% 

95.53% 

3.21% 

1.97% 

1.09% 

As  shown  in  Table  5,  none  of  the  LSTD  algorithms  tested  could  improve  upon  the 
benchmark  solution.  In  addition  to  the  LSTD  algorithm  failing  to  generate  policies 
that  outperform  the  benchmark,  the  subjective  quality  of  the  solutions  were  low. 
Many  observed  policies  were  stationary  over  the  simulated  time,  indicating  that  the 
algorithm  was  unable  to  map  the  value  function  closely  enough  to  modify  the  policy 
dynamically,  given  the  number  of  observations.  This  reinforces  the  earlier  observation 
that  the  value  function  does  not  appear  to  project  onto  the  span  of  the  basis  functions 
selected.  Selection  of  alternate  basis  functions  may  improve  this  solution  quality. 

The  CAVE  algorithm  demonstrates  a  statistically  significant  improvement  over 
the  benchmark  policy.  For  this  problem  instance,  the  total  overages  were  reduced 
by  decreasing  the  number  of  accessions  for  a  small  number  of  accessions  by  one  or 
two.  Overages  above  max  allowable  end  strength  were  nearly  eliminated,  though 
shortages  increased  significantly.  Further  analysis  with  alternative  input  parameters 


37 


should  demonstrate  potential  shortage  and  overage  trade-offs  by  adjusting  the  end 
strength  criticality  coefficient  and  the  AFSC  criticality  coefficients. 


4.5  Computational  Performance 


In  order  to  accurately  compare  computational  load,  all  computational  time  re¬ 
sults  are  reported  based  on  algorithm  performance  on  a  dual  Intel  Xeon  E5-2650v2 
workstation  with  192  GB  of  RAM  and  MATLAB’s  parallel  computing  toolbox  using 
32  threads. 


Table  6.  Computation  Times  (secs) 


Algorithm 

Basis  Functions 

Small 

Solution 

Medium 

Solution  Simulation 

Large 

Solution  Simulation 

LSTD  Random  Sample 

1st  Order 

57.55 

242 

260 

- 

- 

LSTD  Random  Sample 

2nd  Order 

62.62 

271 

289 

- 

- 

LSTD  Random  Sample 

3rd  Order 

64.44 

294 

312 

- 

- 

LSTD  Random  Sample 

4th  Order 

67.09 

318 

336 

- 

- 

LSTD  Latin  Hypercube 

1st  Order 

58.45 

911 

929 

- 

- 

LSTD  Latin  Hypercube 

2nd  Order 

66.05 

924 

943 

118612 

118772 

LSTD  Latin  Hypercube 

3rd  Order 

66.42 

935 

953 

- 

- 

LSTD  Latin  Hypercube 

4th  Order 

70.96 

945 

962 

135799 

136970 

CAVE 

- 

- 

0.58 

1441 

- 

- 

CAVE  (Accessions  Only) 

- 

0.13 

0.38 

953 

67 

167719 

Table  6  shows  significant  fluctuations  in  computation  time  associated  with  the 
number  of  basis  functions  and  sampling  method  utilized.  Additionally,  LHS  improved 
the  quality  of  the  regression,  but  with  a  significant  computational  cost.  This  is  due  to 
the  significant  computational  effort  required  to  generate  LHS  designs  for  large  sample 
sizes. 

While  both  algorithms  result  in  reasonable  computation  times,  the  allocation  of 
time  is  significantly  different.  Table  6  shows  that  approximate  policy  iteration  with 
LSTD  requires  a  significant  computational  effort  to  develop  values  for  the  parameter, 
9,  but  can  then  generate  a  solution  for  any  given  state  very  quickly.  CAVE  can 
solve  for  a  specific  state  much  more  quickly  without  having  to  develop  parameters 
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to  describe  values  for  all  possible  states.  However,  this  process  must  be  repeated  for 
any  given  state  evaluated,  so  simulating  a  large  number  of  states  eventually  increases 
computation  time  to  surpass  that  of  the  LSTD  algorithm.  Figure  4  shows  this  trade¬ 
off  for  the  large  problem  instance. 
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Figure  4.  Computation  Load  for  Simulation  Years 
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V.  Conclusions 


5.1  Conclusions 

The  poor  solution  quality  and  inconsistent  convergence  of  the  LSTD  algorithm 
indicate  that  the  basis  functions  selected  are  not  appropriate  for  this  problem.  The 
successes  of  the  algorithm  in  improving  policies  for  the  medium  problem  indicate  that, 
despite  these  observed  limitations,  these  basis  functions  may  still  be  able  to  provide 
improved  solutions  for  the  large  problem  instance  with  an  excessive  number  of  ob¬ 
servations  and  significant  additional  computational  resources.  Additionally,  these 
basis  functions  seem  to  be  somewhat  related  to  the  true  set  of  basis  functions,  given 
that  consistent  improvement  in  performance  was  observed  when  applying  instrumen¬ 
tal  variables  and  space  filling  designs.  These  improvements  indicate  that  additional 
information  within  the  context  of  these  basis  functions  is  beneficial. 

Even  with  an  acceptable  set  of  basis  functions,  there  is  a  trade-off  in  sample  size 
and  sample  efficiency  that  warrants  further  examination.  For  sample  sizes  that  are 
large  enough  to  map  a  problem  of  this  size,  the  computational  efficiency  of  Latin 
hypercube  designs  must  be  evaluated  according  to  the  level  of  improvement  provided 
in  calculating  the  regression  equation  and  the  computational  demand  of  the  design 
generation.  This  improvement  and  computational  cost  must  be  compared  to  the 
improvement  and  computational  cost  of  additional  purely  random  samples,  given  that 
these  can  improve  solution  quality  without  requiring  design  generation.  Given  the 
significant  computational  burden  of  generating  very  large  designs,  both  in  processing 
time  and  memory,  case  by  case  analysis  is  required  in  order  to  determine  whether 
the  addition  of  a  Latin  hypercube  design  is  more  beneficial  than  a  computationally 
comparable  number  of  additional  samples. 

CAVE  has  been  shown  to  converge  to  optimal  solutions  for  problems  that  do  not 
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have  some  form  of  non-linearity  due  to  interaction  between  decisions  (Topaloglu  & 
Powell,  2003).  This  presents  as  a  single  optimum.  In  this  problem,  each  stochastic 
gradient  sampled  is  only  accurate  given  the  current  values  of  all  other  decisions  being 
examined.  This  is  a  common  problem  for  algorithms  that  utilize  purely  on-policy 
search  to  explore  a  decision  space  that  has  multiple  local  optima.  CAVE  is  able  to 
maximally  exploit  the  information  to  find  a  local  optima  but  does  not  have  an  internal 
mechanism  to  explore  alternative  solutions.  This  explains  why  problem  instances  with 
significant  interactions,  such  as  problems  that  include  promotion  decisions,  suffer 
decreased  levels  of  performance.  In  the  USAF  officer  sustainment  problem,  these 
interactions  exist  between  accessions  and  promotion  decisions,  simultaneous  accession 
decisions  for  different  career  fields,  and  decisions  made  at  different  time  epochs.  The 
algorithm’s  ability  to  improve  solutions  despite  this  non-linearity  provides  an  idea 
of  how  much  improvement  may  be  obtained  with  a  form  of  this  algorithm  that  is 
modified  to  overcome  the  limitations  of  CAVE  when  applied  to  this  type  of  problem 
structure. 

5.2  Future  Work 

To  improve  the  policies  generated  by  CAVE,  two  immediate  solutions  are  appar¬ 
ent.  The  first  is  the  addition  of  some  form  of  off-policy  search.  The  use  of  a  meta- 
heuristic  hybrid  of  algorithms  such  as  TABU  search,  a  genetic  algorithm,  or  GRASP 
to  discover  alternative  starting  locations  may  overcome  this  limitation.  CAVE  is  then 
able  to  refine  this  location  and  converge  to  an  optima.  With  sufficient  sampling,  this 
form  of  the  algorithm  is  likely  to  enjoy  a  proof  of  convergence,  given  the  proofs  of 
convergence  for  specific  instances  of  the  CAVE  algorithm  and  many  of  the  heuristic 
search  algorithms. 

The  second  solution  to  the  limitations  of  the  CAVE  algorithm  for  this  type  of 
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problem  is  to  separate  the  approximation  of  the  overage  and  shortage  functions.  The 
sampled  stochastic  gradient  of  the  shortage  function  (marginal  return)  is  then  used  in 
conjunction  with  the  sampled  stochastic  gradient  of  the  overage  function  (marginal 
cost)  to  evaluate  which  policies  to  adjust.  This  allows  for  the  cost  minimization 
program  to  correctly  ascertain  the  impact  of  increasing  accessions  in  more  than  one 
AFSC  at  a  time.  The  current  approach  can  result  in  many  decisions  changing  simul¬ 
taneously  based  on  sampled  information  regarding  overages  that  is  no  longer  relevant 
after  a  single  change  to  policy.  This  approach  allows  for  a  policy  modification  to  ad¬ 
dress  relatively  important  shortages  while  avoiding  massive  over-corrections  to  poli¬ 
cies.  However,  this  adaptation  will  not  address  nonlinearities  due  to  the  interactions 
of  decisions  made  at  separate  time  epochs. 

To  improve  the  policies  generated  by  the  approximate  policy  iteration  algorithm, 
further  effort  in  redefining  these  basis  functions  or  investigating  alternative  regression 
techniques  is  likely  to  be  more  productive  that  devoting  the  computational  resources 
to  refine  the  parameter  9  for  the  current  implementation.  Discovering  a  correct  set  of 
basis  functions  is  historically  difficult  for  complex  stochastic  problems.  Additionally, 
a  set  of  correct  basis  functions  for  a  problem  of  this  complexity  may  potentially  be 
prohibitively  large.  While  further  trial  and  error  may  improve  this  algorithm,  the 
most  likely  avenue  for  success  in  an  approximate  policy  iteration  framework  is  likely 
to  be  a  non-parametric  technique  such  as  kernel  regression. 
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VI 


Appendix 


6.1  Appendix  A 


Table  7.  Variables  Defined 


X 

Selected  Action 

t 

Time  Index 

St 

Pre-decision  state  at  time  t 

7 

Discount  factor 

Sf 

Post-decision  state  at  time  t 

f 

Individual  basis  function  or  feature 

T 

Set  of  basis  functions  or  features 

e 

Vector  of  weights  associated  with  the  selected  basis  functions 

<t> 

Vector  of  basis  function  values  for  a  given  post-decision  state 

N 

Number  of  policy  evaluation  loops  or  samples 

$ 

Matrix  of  recorded  (j) s  for  N  sampled  post-decision  states 

ct 

Vector  of  recorded  costs  for  N  sampled  post-decision  states 

T 

Maximum  number  of  time  epochs 

7 r 

A  defined  policy 

n 

The  set  of  all  possible  policies 

T 

The  set  of  all  time  epochs 

ai 

AFSC  or  career  field 

a  2 

Grade 

“3 

CYOS  or  year  group 

a 

Vector  of  attributes  ai,  a2,  and  as 

h 

Attribute  index 

Ah 

Set  of  all  possible  attributes  a ^ 

A 

Set  of  all  possible  attributes 

m 

Number  of  AFSCs  or  career  fields  modeled 

n 

Number  of  grades  modeled 

1 

Number  of  CYOSs  or  year  groups  modeled 

Sa,t 

Number  of  resources  at  time  t  for  attribute  vector  a 

d 

Decision  index 

V 

Set  of  all  possible  decisions 

Pd 

Upper  bounds  for  decisions 

C  d 

Lower  bounds  for  decisions 

bai 

AFSC  criticality  coefficient 

Rai  ,a 2 

Requirements  by  AFSC  and  grade 

F 

Maximum  allowable  number  of  personnel  in  system  (end  strength) 

e 

End  strength  criticality  coefficient 

j 

Transition  index 

Pa 

Retention  parameter 

j 

Policy  improvement  index 

M 

Maximum  number  of  policy  improvement  loops 

i 

Policy  evaluation  index 

a 

Stepsize  parameter 

t/j 

The  maximum  order  of  the  polynomial  used  for  a  set  of  basis  functions 

a 

Combination  of  decision  and  polynomial 

Zg 

Coefficient  for  a  decision,  polynomial  combination  within  MINLP 

kdt 

Breakpoint  index 

K-dt 

Set  of  breakpoints 

kmax 

Maximum  number  of  allowable  breakpoints 

Vk*t 

Slope  to  the  right  of  breakpoint  kdt 

Ukdt 

Projection  origin  for  breakpoint  kdt 

Adt 

Observed  sample  gradient 

Qdt 

Smoothing  interval 

_ 

Interval  size  parameter 
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